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DETECTING AND HANDLING OUTLYING TRAJECTORIES IN 
IRREGULARLY SAMPLED FUNCTIONAL DATASETS 

By Daniel Gervini 1 

University of Wisconsin-Milwaukee 

Outlying curves often occur in functional or longitudinal datasets, 
and can be very influential on parameter estimators and very hard 
to detect visually. In this article we introduce estimators of the mean 
and the principal components that are resistant to, and then can be 
used for detection of, outlying sample trajectories. The estimators are 
based on reduced-rank f-models and are specifically aimed at sparse 
and irregularly sampled functional data. The outlier-resistance prop- 
erties of the estimators and their relative efficiency for noncontami- 
nated data are studied theoretically and by simulation. Applications 
to the analysis of Internet traffic data and glycated hemoglobin levels 
in diabetic children are presented. 

1. Introduction. In many statistical problems the collected data consists 
of samples of stochastic processes rather than scalars or vectors. Typical 
examples include human growth curves and circadian rhythms in medicine, 
time-dependent gene expression profiles in genomics, and spectral curves 
in chemometrics. Other examples and an overview of the related statistical 
methodology can be found in Ramsay and Silverman (2005). 

As with univariate or multivariate samples, the presence of atypical ob- 
servations in functional samples tends to complicate the statistical analysis. 
By atypical observations we mean atypical curves, not just isolated points. 
To illustrate the problem, consider the following two examples. The first 
one is a problem on Internet traffic analysis. The data, previously analyzed 
by Zhang et al. (2007), was collected at the main Internet link of the Uni- 
versity of North Carolina campus network during seven consecutive weeks. 
The traffic is measured in packet counts, every half an hour; the logarithm 
of the data for the 35 week days is shown in Figure 1(a). Most trajectories, 
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Fig. 1. Internet traffic data. Trajectories for (a,) 35 week days and (h) 14 weekend days. 

while noisy, show a clear daily pattern: the traffic rises sharply between 7 
and 9 a.m., remains at approximately the same level between 9 a.m. and 4 
p.m., and goes down again between 4 and 7 p.m. However, there is a clearly 
atypical curve, a day with unusually low traffic, and another one less con- 
spicuous but still atypical, corresponding to a day when the traffic peaked 
earlier than usual in the morning. The problems created by these atypical 
curves, and how to deal with them, will be discussed more extensively in 
Section 6. 
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Fig. 2. Child diabetes data. Trajectories of Hoaic levels for (&) 73 females and (b) 66 
males. 
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The second example is more complicated. Figure 2 shows trajectories of 
glycated hemoglobin levels for diabetic children who underwent treatment 
at the Children's Hospital of the University of Zurich. The level of glycated 
hemoglobin (abbreviated HbAic) is used to assess the effectiveness of therapy 
in patients with type-I diabetes mellitus, and to study the long-term effect 
of the disease on physical and intellectual development [see, e.g., Schoenle 
et al. (2002)]. One trajectory is clearly out of control in Figure 2(a), but 
besides that, it is hard to discern any systematic patterns in the data. To 
complicate the problem, HbAic levels are measured at irregular time points, 
with as few as 2 observations for some individuals. This makes individual 
smoothing of the trajectories (which would have eased visualization) very 
hard or even impossible. This example will also be discussed in more detail 
in Section 6, but it is clear that atypical curves cannot always be detected 
by visual inspection, and one must rely on methods that can handle outlying 
curves automatically. 

These examples also show that outliers, in the functional sense, are not 
simply the result of misrecorded data or extreme noise. They correspond to 
individuals that, for some reason, do not follow the pattern of the majority 
of the data, and often deserve to be studied more carefully rather than 
simply discarded. However, these atypical curves must be downweighted at 
the estimation step, or they may lead to erroneous conclusions for the rest 
of the population. 

This article is organized as follows. Section 2 frames the discussion in a 
more rigorous statistical setting, as an estimation problem for stochastic pro- 
cesses. Section 3 proposes an outlier-resistant estimation method, and Sec- 
tions 4 and 5 discuss their asymptotic and finite-sample properties. Section 
6 presents a more thorough analysis of the above examples. Available as sup- 
plementary material are a Technical Report with proofs and mathematical 
derivations, and Matlab programs implementing the proposed estimators. 

2. Functional data models. The data in the examples above and in simi- 
lar longitudinal studies can be thought of as discrete observations of continuous- 
time stochastic processes (or, more generally, of stochastic processes depend- 
ing on a continuous variable). Usually, the data is observed with random 
noise: 

( 1 ) x ij — .Xj (tij ) + Eij , j — 1 , . . . , vn j , i — 1 , . . . , n , 

where {Xi(t)} are i.i.d. trajectories of the stochastic process of interest, 
{tij} are the time points where the trajectories are measured, and {£ij} are 
independent random errors. It is known [see, e.g., Gohberg, Goldberg and 
Kaashoek (2003)] that a stochastic process X £ L 2 ([a, b]) with J £7( 1 1 1 1 2 ) < oo 
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admits the expansion (known as Karhunen-Loeve decomposition) 

oo 

(2) x(t) = fi(t) + Y,ykMt)i 

k=l 

where p(t) = ~E{X(t)}. The (frkS form a nonrandom orthonormal basis of 
L 2 ([a, b]) and the y^s are uncorrelated random variables with zero mean and 
finite variance. If p(s,t) = cov{X(s), X(t)}, we have the representation 

oo 

(3) p(s,i) = ^Ajfc<£fe(s)0jfc(i), 

fc=i 

where A^ = var(y/ c ). If p(s, t) is continuous, then the <^s are also continuous 
and the series (3) converges uniformly and absolutely. This representation 
implies that A/% is an eigenvalue of p with eigenfunction <pk, so the 0&S are 
called "principal components" and the y^s "component scores," in analogy 
with multivariate analysis. 

To a large extent, the stochastic process X(t) is characterized by p(t) and 
p(s,t). Estimating these functions is challenging when the time grid {tij} is 
irregular or sparse, because it makes individual smoothing of the trajectories 
very hard or even impossible (mj may be as low as 1 or 2 for some indi- 
viduals). Some authors that have addressed this problems are Staniswalis 
and Lee (1998), Yao, Miiller and Wang (2005), James, Hastie and Sugar 
(2000), Gervini (2006), and Yao and Lee (2006). These estimators, how- 
ever, cannot handle outlying curves like those in the examples of the In- 
troduction. Estimators that do handle outlying curves were proposed by 
Locantore et al. (1999), Fraiman and Muniz (2001), Cuevas, Febrero and 
Fraiman (2007), and Gervini (2008), but they can only be applied to indi- 
vidually smoothed trajectories. Estimators that are able to handle outlying 
curves and can be computed on sparse and irregularly sampled data have 
not yet been proposed. We present one possible approach in the next sec- 
tion. 

3. Reduced-rank t-models. The eigenvalues A& in (3) typically decrease 
to zero very fast, because J2T=i Afc < °°- Therefore, only the leading terms 
in (2) are of practical relevance, and we can assume 

d 

(4) x(t) = p{t) + Y J yk4> k (t) 

k=l 

for some d, where Ai > • • • > A^ > 0. Smoothness of p and the (^s can be 
built into the model by assuming they are spline functions. That is, we 
assume p(t) = 9 T h(t) and <j>k(t) = rf£h(t), where h(t) £ MP is a spline basis. 
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The observational model implied by (1) and (4) can be succinctly expressed 
as 

(5) x i = B i © + B i HA 1 / 3 z i + (re i , i = l,...,n, 

where B; = [bk{Uj)\j )k ), H= [rj x , . . . ,rj d ], A = diag(Ai, . . . , X d ), z, is the vec- 
tor of standardized component scores and £j are the standardized measure- 
ment errors. If we assume a heavy-tailed distribution for (zj,£j), outlier- 
resistant estimators of fi and the are obtained automatically. The reason 
is that, informally speaking, heavy-tailed models "expect" extreme observa- 
tions, which are then downweighted by the maximum likelihood estimation 
process. 

Specifically, we assume that (zj,£j) has a joint multivariate t distribu- 
tion with v degrees of freedom, location parameter and scatter matrix 
Id+mv which we denote by t v (0,l d+m .). Then x* ~ t v (Bi0, SJ, with Ej = 
B 4 HAH T Bf + u I mi . The maximum likelihood estimating equations for 
this model, which are derived in the Technical Report, are the following: 

<•> E(^f)Bm- i (x,-B i( ,)=o, 

(7) (I d -JHH T )S n r7 fc = 0, k = l,...,d, 

(8) T]lS n r] k = 0, k = l,...,d, 

0) - ^E^r 1 ) + \ E f^ 1 ) (^-B^f et-^-b^ = o, 

j=l j=l ^ * ' 

where 

Si = (x i -B i 0) T 5]r 1 (x i -B i 0) and J = [J bi(t)bj(t) dt] {iJ) . The best linear 
predictor of Zj is E(zj|x i ) = A 1 / 2 H T B^5]r 1 (x i — Bj0), and z, is obtained by 
replacing the model parameters with their estimators. 

What makes these estimators robust are the weights {y + rrii)/{v + Sj) 
that appear in equations (6)-(9). Since Sj is the squared Mahalanobis dis- 
tance between x, and the expected trajectory Bj#, atypical trajectories are 
downweighted and do not seriously affect the estimators. Downweighting is 
strongest for the Cauchy model (y = 1) and becomes less pronounced as v 
increases. When v — > oo, (y + mi)j(y + Sj) — > 1 and one obtains the esti- 
mating equations for the Normal reduced-rank model [James, Hastie and 
Sugar (2000)], which gives equal weight to all sample curves and then lacks 
robustness. 
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These estimators can be easily computed via the EM algorithm, which 
is derived in detail in the Technical Report. The recursive steps are the 

following: given current estimates 6 , H (where 3 = HA 1 / 2 ) and (<7 2 ) old , 
the updates are 



-new 




vec(c ) 



i=l 
n 



V + mi \ „ \d,~ \d\T I „ nTi 



2 \ncw 



t=l 
1 



zy + s? ld 



old, 



,i=l 



z/ + s 



old 



" + m ^lx,-B. ; r ld -B,-3 ol V d || 2 



+ ^trace{B J S old (V° ld )- 1 ( H ° ld ) r Bf} 



i=l 

where Sj and Zj are as before, and Vj = 1^ + H T B^BjE/<7 2 . To obtain H 

and A from H, we find the spectral decomposition of 3 JH, say, UDU 
with U orthogonal and D diagonal, and set A = D and H = HUD~ 1//2 . 

As it is well known, the EM algorithm can take a large number of it- 
erations to converge; but for our estimators each iteration is very fast to 
compute. Most of the computing time (in our Matlab implementation) is 
taken up by the recomputation of the spline basis matrix B.; for each i 
on each iteration, so the computing time grows mostly with n and only 
marginally with d, p or the m^s. To give an idea of the computing times 
involved, each run of the EM algorithm for the simulated data in Section 5, 
with n = 100, takes approximately 15 seconds on a common laptop computer 
with a 2.00GHz Intel Pentium processor. 

In practice, the model dimension d is not known a priori, so the computa- 
tion of the estimators is done in a sequential way. We recommend to begin 
with a mean-only model {d = 0), using 6 = and a 2 = Yli j x 1j/ Yli m i as 
initial estimators for the EM iterations. Then proceed by adding one prin- 
cipal component at a time, using the estimators of the previous (d — 1)- 
dimensional model as initial estimators for the d-dimensional model. The fi- 
nal dimension do can be- chosen subjectively or objectively. Subjective ap- 
proaches include choosing a d that yields a small ratio A^/ X^fc=i or a small 
value of Arf compared to the noise variance a 2 . Objective model selection 
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methods can be based on the maximization of the penalized log-likelihood 

n 

(io) ^io g f( Xl \d,n,A)-c n df, 

i=l 

where /(xj|0,H,A) is the ^(Bj0,Sj) density evaluated at Xj, df are the 
degrees of freedom of the model, and c n is a constant. Concretely, c n = 1 
defines the AIC criterion and c n = logn/2 the BIC criterion. This approach 
has been used in the functional data context [Yao, Miiller and Wang (2005)] 
for normally distributed data only, but Shen, Huang and Ye (2004) justify 
their use for exponential distributions in general. The degrees of freedom of 
the model are the number of parameters minus the number of orthonormality 
restrictions. Another objective method that can be used, although in practice 
it tends to underperform (10), is cross-validation. Cross-validation would 
maximize 

n 

where ^-(-i) an d ^(-i) are the estimators computed without observa- 

tion Xj. 

Another aspect that is rather subjective is the choice of basis functions for 
/j,(t) and 4>k(t), particularly the knot placement and quantity. If a large num- 
ber of knots is used, placement becomes less important but regularization 
is necessary. This can be accomplished by adding roughness penalty terms 
of the form a f {//'(t)} 2 dt and o.k f {<Afc(i)} 2 dt to the log-likelihood function 
(the resulting modifications of the EM algorithm are straightforward, since 
these terms are quadratic in the parameters). Selection of the smoothing 
parameters a and can be done, again, either subjectively or objectively. 
The penalized log-likelihood approach, however, is not as straightforward to 
implement as before, because the degrees of freedom of the model are not 
as easy to calculate when the fitted values Xj are not linear functions of the 
data [Ye (1998); Efron (2004)]. Cross-validation, on the other hand, can be 
implemented as easily as usual despite its shortcomings. Nevertheless, since 
and the ri k s are model parameters common to all curves, the estimators 
6 and {r/fc} "borrow strength" across individuals and then the choice of 
smoothing parameters is less problematic than if each curve were smoothed 
individually. This is based on our experience with spline smoothing rather 
than on formal mathematical results, although for kernel smoothers Yao, 
Miiller and Wang (2005) and Kneip (1994) have indeed established rates of 
convergence of the estimators and the bandwidths that depend fundamen- 
tally on the number of curves n rather than the number of observations per 
curve rrii. 
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4. Asymptotic properties. The distributional assumptions made in Sec- 
tion 3 were just working assumptions to derive robust estimators of fj,(t) and 
the (fik(t)s. In this section we will study the consistency of the estimators 
under broader conditions. We will also study their sensitivity to outliers, as 
quantified by the influence function. 

To simplify, let us assume that the individual time grids tx,...,t n are 
i.i.d. realizations of a random vector t S M m , so mj = m for all i. Let 
w = (t,x) and let us collect all model parameters in a single vector £ = 
(Q,T}i,..., r] d , Ai, . . . , Xd, o 2 ). The estimating equations (6) to (9) can be ex- 
pressed as a single system of equations Si l =l' , / , ( w i>£) = for an appro- 
priate function V(w, •) : -> ]R( p+1 )( d+1 ). Estimators of this type 
are called M-estimators, or sometimes Z-estimators [Maronna, Martin and 
Yohai (2006), Chapter 3; Van der Vaart (1998), Chapter 5]. For such esti- 
mators the notion of Fisher consistency is useful. Suppose w = (t,x) follows 
model (5) with parameter £ = £ , and let Fq be the resulting distribution of 
w. Let £ = £(-Fb) be the solution to the eqnarray Ei? {?/>(w, £)} = 0. In prin- 
ciple, £(-Fb) need not be equal to the true model parameter £ ; if it is, the 
estimator is said to be Fisher consistent [Maronna, Martin and Yohai (2006), 
page 67]. It turns out that under some regularity conditions, M-estimators 
£ converge in probability to (,(Fq) as n goes to infinity; then, under those 
regularity conditions, Fisher consistency implies the usual consistency [Van 
der Vaart (1998), Theorem 5.9]. 

The next theorem shows that 9 and the f) fc s are Fisher consistent under 
broad conditions, whereas a 2 and the A^s are off by a common factor [this 
is typical of M -estimators of scale parameters; see Maronna, Martin and 
Yohai (2006), Chapter 6.12]. 

Theorem 1. If w = (t,x) follows model (5) with parameter £ 0) and 
(z,e) has a joint spherical distribution, then ^(Fq) = (6q,t] 01 , . . . ,rj od , AAoi, 
. . . , AAod) /3o°~o) with (3o > a factor that depends on the distribution of 
(z,e) and on v but not on the model parameter £ . 

Theorem 1 implies that the estimators of fj, and the cj>kS derived under a 
t v distributional assumption on (z,e) are actually Fisher consistent under 
any spherical distribution of (z, e), including the Normal distribution or a 
t v * distribution with u* ^ v. The estimators of a 2 and the A^s, although 
not Fisher-consistent, are off by a common factor {3o, which implies that the 
ratios A^/o -2 and Xd/Ylt=i are Fisher-consistent. 

Now we turn our discussion to the outlier sensitivity of the estimators. 
Outlier sensitivity can be measured by the influence function [Maronna, 
Martin and Yohai (2006), Chapter 3], which is defined as 

IF(w; I F ) = lim -{£((1 - e)F + e5 w ) - Z(F Q )}, 
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where <5 W is the point-mass distribution at w. The gross-error sensitivity 
of £ is defined as 7* = sup w || IF(w;£, Fq)\\. Note that for a small contam- 
ination proportion e, the asymptotic bias caused by 5 W is approximately 
eIF(w;^,Fo). Therefore, if 7* < 00, the bias is bounded regardless of the 
location of the outliers and the estimator £ is said to be locally robust. 

For regular M-estimators, it can be shown that IF(w; £, Fq) = — M _1 i/>(w, 
£(F )), where M = Ei? o {<9'0(w,£)/<9£' r |£ = £( Fo )} [Maronna, Martin and Yohai 

(2006), Chapter 3]. Then 7* < A m f(MM r ) su Pw ||^(w,£(F ))||, where 
Amin(A) denotes the smallest eigenvalue of A, so 7* < 00 as long as M 
is invertible and "0( w >£) is bounded in w. This is true for our estimating 
functions if), so the t-model estimators £ are locally robust. 

Influence functions are also useful for the computation of asymptotic vari- 
ances. Under appropriate regularity conditions, \/n(£ — $,(Fq)) converges in 
distribution to a 2V~(0,V) with V = E{IF(w; £, F Q ) IF(w; £, F ) T } [Van der 
Vaart (1998), Theorem 5.21]. This result is useful, for instance, to derive 
asymptotic confidence bands for n(t), provided one can obtain a more ex- 
plicit expression for the px p block of V that corresponds to the asymptotic 
variance of 0. In some cases this is possible, as the next theorem shows. 



Theorem 2. Ifw = (t,x) follows model (5) and (z,e) has a joint spher- 
ical distribution, then M has a block structure 



M 



Mn 






M22 



with Mn G K pxp given by 



where 



Mn=E Fo | 5 (w)B T (t)^-^ 1 (t)B(t)|, 

2 {v + m)s(w)/ v + m 



#(w) 



m {u + s(w)//3 } 2 v + a(w)/A) ' 

B(t) = [6 fc (t y )] (i)fe) , S(t) = B(t)H A H^B T (t) + C j2L m , s (w 
S _1 (t){x — /x (t)} and £t (t) = B(t)#o- Furthermore, 



IF(w;0,F o ) = -M n 1 



v + m 



{x-/x (t)} T x 
B T (t)-^-£ _1 (t){x - /x (t)}. 



From Theorem 2 we see that the asymptotic covariance matrix of 9 has 
the form M^AMy, and due to the block structure of M, 6 is asymptot- 
ically independent of {fjj.}, {^k} an d <J 2 . The matrices Mn and A can be 
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estimated by 

1 n 

Mi^-V&BfS^Bi 
n z — ' 

i=l 

and 

-i n / i \ 2 

a=- e ( ^ ) - ax* - Af=r a,, 

1=1 

where 

2(z^ + m)sj v + m 

Note that, by Theorem 1, and s, are consistent estimators of s(wj)//3o 
and 5] _1 (tj)//?o, so M^ 1 AMj"/ is a consistent estimator of Mj/AM^ 1 . 

5. Simulation study. 

5.1. Assessment of parameter estimators. We studied the finite-sample 
behavior of the estimators by simulation. We were mainly interested in the 
relative efficiency of the estimators for normally distributed data and in 
their bias under outlier contamination. Three estimators were considered: 
the maximum likelihood estimator for (a) the Normal model [James, Hastie 
and Sugar (2000)], (b) the Cauchy model, which is a i-model with v = 1, and 
(c) the 4-model with v = 5. As spline basis we chose cubic splines with five 
equidistant knots. We considered different simulation scenarios (described 
below) but only part of the results are reported here (Table 1). The rest can 
be found in the Technical Report. 

To assess the efficiency of the estimators, we simulated data from the 
two-component model 

2 

(11) Xij = fi(Uj) + y^ J z ik ^/X^(j) k (ti j ) + aeij, 

k=l 

with fj,(t) = and <fik(t) = y/2s'm(kTrt), for t G [0,1]. The component scores 
and the random errors were independent iV(0, 1) and Ai = 1, A2 = 
0.5, a 2 = 0.25. Three desi gns were considered for the tijs: (i) m = 20 fixed 
uniformly spaced points in [0, 1], (ii) m = 20 random points (which vary from 
curve to curve) with uniform distribution in [0, 1] , and (iii) mi random points 
with uniform distribution in [0, 1], where mi, . . . ,m n was a sample from a 
Poisson random variable with mean 15. The third design is the one that best 
resembles sparse and irregularly observed data. As sample sizes we took n = 
50, n = 100 and n = 200. Each sampling situation was replicated 500 times. 
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Table 1 

Simulation results. Root mean squared errors of different estimators for noncontaminated 
normal data and outlier- contaminated data 







No 


Endogenous contam. 


Exogenous contam. 


Estim. 


Model 


contam. 


10% 


20% 


30% 


10% 


20% 


30% 


1' 


Normal 
Cauchy 

ts 


0.142 
0.169 
0.159 


0.427 
0.190 
0.183 


0.819 
0.247 
0.254 


1.205 
0.330 
0.365 


0.367 
0.162 
0.153 


0.703 
0.184 
0.179 


1.040 
0.212 
0.224 


k 


Normal 
Cauchy 
ts 


0.142 
0.165 
0.163 


1.091 
0.299 
0.338 


1.331 
0.627 
0.673 


1.363 
1.006 
1.087 


0.942 
0.158 
0.152 


1.265 
0.189 
0.183 


1.290 
0.220 
0.232 



Root mean squares of ||/t — n\\ and \\4>i — 4>i\\ are given in Table 1, for grid 
design (ii) and sample size n = 100. The relative behavior of the estimators 
is similar for the other designs and sample sizes, as can be seen in the more 
detailed results shown in the Technical Report. We see that the i-model 
estimators are generally less efficient than the Normal- model estimators, 
as expected, but the loss of efficiency is minimal. We also note that the 
estimators ft were obtained by fitting a mean-only model to the simulated 
data, whereas the estimators 4>i where obtained by fitting a one-component 
model to the data; therefore, the models were always underspecified, but 
this did not seem to affect the consistency of the estimators (the boxplots 
in the Technical Report show that the errors decrease as n increases). 

To assess the robustness of the estimators, two types of outliers were 
considered; we call them endogenous and exogenous. Endogenous outliers 
are curves that belong to the space spanned by {01,02} just like the rest 
of the data, only that the component scores follow a different distri- 
bution. Exogenous outliers, on the contrary, are curves that do not be- 
long to the space spanned by {0i,02j-. In these simulations we generated 
exogenous outliers by taking linear combinations of (j>±, 4>2 and 4>3(t) = 
c{t(l - t)} 1 ' 2 sin{27r(l + 2^~^^)/{t + 2( 9 " 4fc V 5 )} with k = 5 (the so-called 
"Doppler function," with c such that \\4>3\\ = 1). Three contamination pro- 
portions were considered for the scenarios described below: e = 0.10, e = 0.20 
and e = 0.30. The time grid was generated following the uniform random de- 
sign (ii), and the sample size was n = 100. Each scenario was replicated 500 
times. 

Let us first examine the robustness of ft. Endogenous outliers were gen- 
erated by replacing en component scores zn with a large constant K (so 
the outlying curves were virtually identical to K\f\\(j)i ) , whereas exogenous 
outliers were generated by adding Ky/Xifo to en sample curves. We consid- 
ered two contaminating constants, K = 4 and K = 8, but only the results 
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for K = 4 are reported in Table 1 (the results for K = 8 are given in the 
Technical Report). Since the "true" mean for these samples are eK(j>i and 
eK4>z, respectively, because Ai = 1, the root mean squared errors should be 
approximately eK for nonrobust estimators. This is exactly what we see 
in Table 1 for the Normal-model estimator. In contrast, i-model estimators 
show remarkably low biases, even for contamination proportions as high as 
30%. 

To study the robustness of cfti, endogenous outliers were generated by 
replacing en/2 scores Zi 2 with K\/X~2 and en/ 2 with —Ky/X^; exogenous 
outliers were generated by adding Ky/Xifo to en/2 sample curves and sub- 
tracting the same quantity to other en/2 sample curves. As before, we used 
K = 4 and K = 8 but only report the case K = 4 here, since the other re- 
sults are similar. Note that these symmetric contaminations affect <p\ but 
do not affect fi, because they alter the covariance structure without chang- 
ing the mean. In fact, the endogenously contaminated data follows model 
(11) with A* = (1 - e)Ai and X* = (1 - e)X 2 + eX 2 K 2 , so A2 can be actu- 
ally larger than A* if K is big enough, in which case we expect the root 
mean squared error of a nonrobust estimator to be close to ||</>2 — <fii || = V2- 
This is what we observe in Table 1. The exogenously contaminated data 
also follows model (11) with three components, but the components are not 
01, <p2 and 03 (because they are not orthogonal). We see in Table 1 that 
endogenous outliers have a more deleterious effect on the estimators than 
exogenous outliers. In fact, the t-model estimators are practically unaffected 
by exogenous outliers. Under endogenous contaminations, the performance 
of the i-model estimators deteriorates for large contamination proportions, 
although they still outperform the Normal-model estimators. 

Overall, the conclusion from this Monte Carlo study is that t-model esti- 
mators are highly resistant to outliers, even for relatively large contamina- 
tion proportions, and have a high relative efficiency for Normal data. Given 
that their computational complexity is comparable to that of Normal-model 
estimators, we think that they are a practical and safer alternative. In par- 
ticular, we recommend the use of Cauchy-model estimators, since they are 
the most robust in the t family and are not much less efficient than 4s-model 
estimators for Normal data. 

5.2. Assessment of model selection criteria. We also ran a Monte Carlo 
study to evaluate the performance of the AIC and BIC criteria for selection 
of the model dimension d. We generated data from the two-component model 
(11) and from a symmetrically contaminated model with exogenous outliers, 
as explained above. Since exogenous contamination introduces a spurious 
third direction of variability, the expected effect on the AIC and BIC criteria 
is an overestimation of the model dimension. 
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We compared two types of estimators: the Normal- model estimators and 
the Cauchy-model estimators. As before, we chose cubic splines with five 
equidistant knots as spline basis; then p = 9 and, for the true model with 
d = 2, the degrees of freedom are 27. We considered two sample sizes, n = 20 
and n = 60 (note that in the former case n is less than the degrees of freedom 
of the true model). The models were fitted in a sequential way, as suggested 
in Section 3, from d = to d = 4. Each sampling situation was replicated 
300 times. 

The results are summarized in Table 2. We show two outputs: the percent- 
age of the samples for which the right model is selected and the percentage 
of the samples for which the next model {d = 3) is selected; the remaining 
percentage would correspond to the four-dimensional model, since we ob- 
served that models with d < 2 were never selected. For noncontaminated 
data, it is clear that the criteria have no trouble selecting the right model, 
neither for Normal nor for Cauchy estimators. For low contamination levels 
(e = 0.10) the AIC and BIC based on Cauchy estimators select the right 
model in the vast majority of cases, with the BIC criterion being clearly 
superior; the nonrobust Normal estimators, in contrast, almost never led to 
the right choice of model. For larger contamination proportions, even the 
robust estimators break down; but even then we note that the BIC based on 
Cauchy-model estimators outperforms the alternatives, since it selects the 
slightly overspecified model d = 3 most of the time and very rarely leads to 
the worst choice d = 4, in contrast to the other methods. All things con- 
sidered, we think the BIC based on Cauchy estimators is a recommendable 
criterion for selection of the number of components. 



Table 2 

Simulation results. Percentage of times AIC and BIC select a two-component model and 
a three- component model, for Normal and Cauchy estimators and several contamination 

proportions 



Contamination proportion 



n 


Method 


0% 


10% 


20% 


30% 


20 


AIC-Nor 


(99,1) 


(1,74) 


(3,65) 


(6.3,62.3) 




BIC-Nor 


(99.7,0.3) 


(1,79.3) 


(4.7,71.3) 


(11.3,66.3) 




AlC-Cau 


(98,2) 


(74,24.3) 


(12.3,83.7) 


(0,87.3) 




BIC-Cau 


(99.7,0.3) 


(84.7,15.3) 


(20.7,78) 


(0.3,94.7) 


60 


AIC-Nor 


(100,0) 


(0.3,60.7) 


(0.3,53.7) 


(1.3,62) 




BIC-Nor 


(100,0) 


(0.3,62.3) 


(0.3,55.3) 


(2,66) 




AlC-Cau 


(100,0) 


(79.3,20) 


(0.3,76) 


(0,67.7) 




BIC-Cau 


(100,0) 


(89.3,10.7) 


(1,94.7) 


(0,92) 
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6. Examples. 

6.1. Internet traffic. Accurate modeling of Internet traffic data is essen- 
tial for an efficient allocation of computational resources. In this section we 
show that just a couple of atypical curves can lead to seriously misleading 
results. The data, previously analyzed by Zhang et al. (2007), was collected 
at the main Internet link of the University of North Carolina during seven 
consecutive weeks (from June 9 to July 25, 2003). The traffic is measured 
in packet counts, every half hour. The logarithm of the data for week days 
is shown in Figure 1(a) and for weekend days in Figure 1(b). 

Although the data is very noisy, we see that the trajectories follow a 
regular pattern, which is different for week days than for weekends. Here we 
analyze only the 35 week days. There is a very clear outlier in Figure 1(a), a 
curve that actually looks like a weekend trajectory. This curve corresponds 
to the Fourth of July. A more subtle atypical curve corresponds to June 27, 
the second day of classes and the last day for late registration for the Second 
Summer Session. That day the traffic peaked two hours earlier than usual, 
and also decreased earlier than usual in the afternoon. 

We estimated the mean and the first two principal components using 
Normal-model and Cauchy-model estimators based on cubic splines with 10 
equispaced knots. The results are shown in Figure 3. Rather than plotting 
the principal components themselves, we show their effect on the mean, by 
plotting £l plus/minus a constant times 4>k- This makes interpretation easier. 
We see that the mean estimators are similar but the principal components 
are completely different. The Normal-model estimator of the first component 
is an amplitude effect (above/below the mean, but parallel to it) and the 
second component is a shape component (traffic higher than the mean until 
3 p.m. and lower than the mean afterward). The first component is clearly 
dominant, since Ai = 0.329 and A2 = 0.085. It is very suspicious that these 
components essentially mimic the two outliers; in fact, July 4 has the largest 
first component score and June 27 the largest second component score. 

On the other hand, the Cauchy-model estimators of the components ex- 
plain amplitude variability at the end of the day (the first component) and 
at the beginning of the day (the second component), with the total vari- 
ability roughly equally split (A\ =0.176 and A2 = 0.123). Of course, the 
fact that the two methods produce different estimators does not automat- 
ically imply that the Cauchy-model estimators are better, but a residual 
analysis confirms this. Cauchy-model estimators produce smaller residual 
norms ||xj — Xj|| than Normal- model estimators for 25 of the 35 observa- 
tions. The median residual norm for the Cauchy fit is 0.556, while for the 
Normal fit it is 0.592. Figure 4 shows individual predictors and residuals; 
undoubtedly, Cauchy-model estimators offer an overall better fit (except for 
the Fourth of July outlier). Normal- model estimators show a particularly 
poor fit for the Internet traffic between and 6 a.m. 
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(a) (b) 
22.5 I ■ . . ■ 1 I ■ ■ 




5 10 15 20 5 10 15 20 

Time (h) Time (h) 



Fig. 3. Internet traffic data. Estimators of the mean ( ) and the mean plus ( ) 

and minus (— ■—) a constant times the principal component, for Normal-model estimators 
[(&), (c)J and Cauchy-model estimators [(h), (d)] of the first [(a,), (b)J and second [(c), 
(A)] principal components. 



6.2. Child diabetes study. Glycated hemoglobin (HbAic) levels are often 
used as a measure of average plasma glucose concentration over certain peri- 
ods of time. Figure 2 shows trajectories of Hb^ic levels for diabetic children 
who underwent treatment at the Children's Hospital of the University of 
Zurich. The profiles are very irregularly sampled and noisy. For girls, the 
minimum number of observations per trajectory is 2, the median 33 and the 
maximum 55; for boys, the minimum number of observations per trajectory 
is 2, the median 33 and the maximum 56. For such irregular data individ- 
ual smoothing is impractical, even impossible for the shortest trajectories. 
The presence of at least one outlying curve is plain to see in Figure 2(a), 
although it is hard to tell by visual inspection if there are any other outliers. 

We estimated the mean and the principal components for each sex, using 
Normal and Cauchy maximum likelihood estimators. Cubic splines with 6 
equispaced knots were used as basis functions. The BIC based on Normal- 
model estimators selects a three-component model for both sexes, while the 
BIC based on Cauchy-model estimators selects a four-component model; 
however, in the latter case A4 is very small compared to a 2 and the other 
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(c) (d) 




-0.8 1 1 1 1 1 1 1 1 1 1 u - 

5 10 15 20 5 10 15 20 

Time (h) Time (h) 



Fig. 4. Internet traffic data. Estimated trajectories [(&), (h)] and residuals [(c), (A)] 
from Normal-model estimators [(&), (c)] and Cauchy-model estimators [(h), (A)]. 



As, so we settled for a three-component model. Figure 5 shows the estimators 
of the mean and the two leading components (the third one is omitted for 
better visibility). For males, both methods produce similar estimators, but 
for females the differences are striking. The Normal-model estimators not 
only overestimate the mean but also provide a very irregular estimator of 
the first principal component; the estimator of the second component is also 
substantially different from the Cauchy-model estimator. 

The trajectory with the largest mean squared residual for girls is shown in 
Figure 6. This is a patient whose diabetes level was clearly out of control. We 
see that the Normal-model estimator provides a somewhat better fit for this 
curve than the Cauchy estimator, but this is at the expense of a poorer fit 
for the rest of the individuals. The mean squared residual of this observation 
is 22.6 for the Normal fit and 24.8 for the Cauchy fit. However, the three 
quartiles of the mean squared residuals for the whole sample are 0.18, 0.40 
and 0.63 for the Cauchy fit, and 0.24, 0.43 and 0.71 for the Normal fit, so 
the Cauchy fit is better overall. Another confirmation of this is that the 
Normal-model estimators obtained after eliminating the outlying trajectory 
are very similar to the Cauchy estimators. 
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(a) (b) 




-0.6 1 1 1 1 1 1 1 1 1 ' ' 1 1 1 1 1 

4 6 8 10 12 14 16 18 4 6 8 10 12 14 16 18 

Age (years) Age (years) 



Fig. 5. Child diabetes data. Estimated means [( a), (b)J and leading principal components 
[(c), (A)] of HbAic trajectories for females [(&), (c)J and males [(h), (A)], using Normal 
(dashed line) and Cauchy (solid line) maximum likelihood estimators. 

25 i 1 1 1 1 1 1 1 




4 6 8 10 12 14 16 18 

Age (years) 



Fig. 6. Child diabetes data. Outlying trajectory (dotted line), Cauchy-model estimator of 
the mean (thick solid line), and fitted trajectory using Cauchy-model predictor (thin solid 
line) and Normal-model predictor (dashed line). 
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7. Conclusion and discussion. As we have shown in Section 6, outlying 
curves do occur in longitudinal and functional datasets. When individual 
smoothing is feasible, they can be handled by the robust methods alluded to 
in Section 2. But when the data is sparse and irregular, individual smoothing 
is unfeasible and methods that employ the raw data must be used. One pos- 
sible approach has been presented in this article. The idea of using t models 
to derive robust estimators is not new to Statistics [see, e.g., Lange, Little 
and Taylor (1989)], but those procedures were specifically developed for low 
dimensional multivariate data. They cannot be applied "off the shelf" to 
functional or longitudinal data, where the dimension of the covariance ma- 
trix often exceeds the sample size. However, an adaptation of the reduced- 
rank approach of James, Hastie and Sugar (2000) provides a way to imple- 
ment i-model estimators in the functional data context. The approach we 
have followed is the simplest one, which is to assume that (zj,£j) in (5) is 
jointly t distributed, and, as a result, the x^s themselves have a multivariate 
t distribution. But other approaches are possible. For instance, it could be 
assumed that Zj and e$ have multivariate t distributions but are indepen- 
dent, or even that each has an independent t distribution. Unfortunately, 
none of these assumptions imply that the XjS have a multivariate t distribu- 
tion, which complicates the theoretical study of the estimators' properties 
and the derivation of the EM algorithm. Nevertheless, these alternatives are 
worth further research. 

SUPPLEMENTARY MATERIAL 

Technical Report and Matlab code (DOI: 10. 1214/09- AOAS257SUPP; 
.zip). The pdf file contains proofs, technical derivations and more detailed 
simulation results not given in the paper. The zip file contains Matlab pro- 
grams implementing the EM algotihm for Normal and t reduced-rank mod- 
els. 
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